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Abstract 


An overview of failure-tolerant control is presented, begin- 
ning with robust control, progressing through parallel and analyt- 
ical redundancy, and ending with rule-based systems and artifi- 
cial neural networks. By design or implementation, failure-toler- 
ant control systems ai£ "intelligent" systems. All failure-tolerant 
systems require some degree of robustness to protect against 
catastrophic failure; failure tolerance often can be improved by 
adaptivity in decision-making and control, as well as by redun- 
dancy in measurement and actuation. Reliability, maintainability, 
and survivability can be enhanced by failure tolerance, although 
each objective poses different goals for control system design. 
Artificial intelligence concepts are helpful for integrating and 
codifying failure-tolerant control systems, not as alternatives but 
as adjuncts to conventional design methods. 


Introduction 

Many devices depend on automatic control for satisfactory 
operation, and while assuring stability and performance with all 
components functioning properly remains the primary design 
goal, there is increasing need for controlled systems to continue 
operating acceptably following failures in either the system to be 
controlled (the plant) or in the control system itself. 1 A distinc- 
tion should be made between system failures, which occur when 
components break or misbehave, and system faults, which in- 
clude improper design as well. Our attention is directed at the 
former, as improper design is a separate issue. 

Failure-tolerant control systems can be characterized as ro- 
bust, reconfigurable, or some combination of the two. A well- 
designed feedback controller typically reduces the plant’s output 
sensitivity to measurement errors and disturbance inputs; if the 
plant is lightly damped or unstable, it provides closed-loop sta- 
bility as well. It is designed assuming some nominal physical 
structure for the plant, expressed by a mathematical model and a 
set of parameters. A controlled system that retains satisfactory 
performance in the presence of variations from this model with- 
out changes in the control system’s structure or parameters is said 
to be robust. The degree of failure that can be accommodated by 
a fixed control structure is more restricted than that of a variable 
control structure. If the structure or parameters can be altered 
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1 For the purposes of this paper, the plant is defined as a dynamic system 
containing components that impart distinctive physical properties like mass, 
inertia, elasticity, farces, and moments. The plant's motion (position and 
velocity) must be controlled for satisfactory operation. The control system is 
an assemblage of additional components — motion sensors, farce and moment 
actuators, and computers - that provide this service. A controlled system is 
a plant plus its control system. 
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following system failure, the control system is reconfigurable. 

In the latter case, the control system detects, identifies, and 
isolates failures, and it modifies control laws to maintain accept- 
able performance. A system that is failure-tolerant through re- 
configuration is both adaptive and redundant. It is adaptive in its 
ability to adjust to off-nominal behavior, as occurs from loss or 
degradation of sensors, actuators, and power supplies, damage to 
signal and power transmission channels, or unexpected alteration 
of the plant s characteristics. It is redundant in its ability to over- 
come lost capabilities with remaining resources. Redundancy can 
be provided by similar parallel channels for measurement and 
control, or it may result from flexible logic that synthesizes 
missing measurements or control forces using operable sensors 
and actuators, effectively invoking dissimilar parallel channels. 
A reconfigurable control system must be robust enough to pre- 
clude controlled system failure while adaptation is taking place. 

While there is much debate as to what constitutes true 
"machine intelligence," it can be argued that adaptivity and re- 
dundancy are attributes of : itelligence and, in the same light, that 
feedback control makes use of information in am intelligent fash- 
ion. The issue is not that adaptive, feedback controllers pass the 
seminal Turing test [1] or possess "consciousness" [2], It is that 
they exhibit the "ability involved in calculating, reasoning, per- 
ceiving relationships and analogies, learning quickly, storing and 
retrieving information, ... classifying, generalizing, and adjusting 
to new situations," [3] at least in a symbolic or quantitative sense. 
To the extent that symbols and instructions reflect knowledge and 
decisions, a failure-tolerant, feedback control system can be 
called intelligent, and that context is adopted here. 


Controlled Systems 

Attention is focused on the control of continuous-time dy- 
namic systems (or plants) whose motions can be represented by 
integrals of nonlinear ordinary differential equations, 

*(0 = f[x(r),u(r),w(r),p] (1 

where x(r) is the n -dimensional state, u(r) is the m-dimensional 
control, w(r) is an r-dimensional disturbance, and p is an /-vector 
of parameters. The state is observed through the measurement r- 
vector, 

z(r) = h[x(r),u(r),w(r),n(r),p] (2 

where n(r) is an r-dimensional measurement-error vector. Along 
a nominal trajectory specified by Xo(0. UoW. w o(0, and n 0 (r) for 
r in (ro, if), perturbations of the state and observation vectors are 
governed approximately by linear, time-varying equations. 
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A*(t) = F(r)Ax(r) + G(r)Au(r) + L(r)Aw(r) 


(3 


Az(0 = H, : (r)Ax(r) + H u (r)Au(r) + H w (OAw(r) + n(r) (4 

F, G,L,H X , H u , and H w arc conformable Jacobian matrices 
expressing sensitivities to the perturbation variables. At discrete 
instants of time, r^, f^+t, and so on, the state and measurement 
perturbations can be approximated by 

A%k+l = OfcAxk + TkAuk + AkAwk (5 

Azk+i = Hx^Axk + Hu k Auk + H^^Awk + nk (6 


**(t) 


Desired 
State Response 


Control Input Observation 

CONTROLLER 
1 

u (t) 

PLANT 

z (t) 






x(t) 

State 

Estimate 


ESTIMATOR 


where the subscript "k" indicates evaluation at fk- Here, <D, F, 

and A have the same dimensions as F, G, and L and are derived 
from the system's state transition properties (e.g., [4]). These 
models provide a foundation for the remaining discussion. 

Control logic for the nonlinear plant (eq. 1 and 2) typically 


takes the form of a dynamic compensator. 

Auk = - Ck (7 

^k+l = v Fk^ik + ©kAuk + Kkfzk - h(xk,Uk)] (8 

£k = [Axk T Xk T ] T (9 

tik = Mo + Auk (10 

k 

A A , A /in 

xic^Xo+Axk (11 
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This linear, time-varying structure exemplifies estimation and 
control functions for discussion purposes, but more complex 
structures -- particularly nonlinear ones - may be employed. It is 
equivalent to a feedback control law (eq. 7) operating on the in- 
ternal state estimate Ax contained in the (n + £)-dimensional qk 

(eq. 8). Xk is a i-vector of compensation components, such as 
integrals of state elements. The control and estimation gains, Ck 
and Kk, are selected to provide satisfactory nominal response and 

may vary in time, 'f'k and ©k normally represent nominal values 
of <E>k and Fk plus integrating (i.e., accumulating) or filtering op- 
erations associated with Xk- The desired state and corresponding 
control for the nonlinear plant, x 0 ^and Uo^, enter as in eq. 10 

and 11. 

Figure 1 represents an idealized controlled system, with dis- 
turbance and noise inputs not shown. While the figure identifies 
the elements of nominal control system design, it provides little 
insight about control system components, all of which may fail. 
Tangible components are needed for measurement and actuation 
(Fig. 2), the control logic described by eq. 7 to 1 1 is executed in 
a computer, these components are enabled by a power supply, 
and the power supply also is subject to failure. An ancillary issue 
is that sensors and actuators — themselves physical systems — 
have scale factors, biases, and dynamic characteristics to be con- 
sidered during failure detection and identification. The simplest 
means of doing this is to incorporate these characteristics in the 
plant model (cq. 1 and 2), with estimation and control logic 
modified accordingly. 


Figure 1. Idealization of a controlled system. 



Figure 2. Components of a controlled system. 


Objectives and Issues for Failure-Tolerant 
Control 

Failure tolerance may be called upon to improve system reli- 
ability, maintainability, and survivability. The requirements for 
failure tolerance are different in these three cases. Reliability 
deals with the ability to complete a task satisfactorily and with the 
period of time over which that ability is retained. A control sys- 
tem that allows normal completion of tasks after component fail- 
ure improves reliability. Maintainability concerns the need for 
repair and the ease with which repairs can be made, with no 
premium placed on performance. Failure tolerance could increase 
time between maintenance actions and allow the use of simpler 
repair procedures. Survivability relates to the likelihood of con- 
ducting an operation safely (without danger to human operators 
or the controlled system), whether or not the task is completed. 
Degraded performance following failure might be permitted, as 
long as the system can be brought to an acceptable state of rest. 

Improving the reliability of individual components clearly 
helps in all three categories; however, it does not follow that what 
aids one objective aids another. For example, replacing a single 
string of control system components by three parallel strings of 
identical components (plus selection or averaging logic) may im- 
prove reliability, but it also increases the likelihood of component 
failures, degrading maintainability. Conversely, redundancy 
within line-replaceable units (LRUs) could improve maintainabil- 
ity if it allows LRUs to be changed less often. Adding a separate 
string of less-capable components may improve survivability 
without improving reliability while decreasing maintainability. 

The principal categories of failure are plant alterations, actua- 
tor and sensor failures, computer failure, and power sup- 
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ply/transmission failure. Actuators, sensors, and other analog 
components are subject to many failure types, some of which 
may be subtle but nonetheless damaging: parameter variation, 
abrupt or random bias shift, abrupt or random scale factor shift, 
change in saturation limits, drift, open circuit, hardover (or 
stuck), and noise. Digital computer hardware failures have en- 
tirely different characterisucs, but it can be argued that they are 
never subtle, as internal clock rates are high and the loss of co- 
herent output is obvious [5], Computer software does not fail 
per se, but it is susceptible to programming faults that may sur- 
face unexpectedly and that may be hard to detect. Multiple fail- 
ures can occur, particularly as a consequence of physical damage, 
and they may be intermittent; hence, reconfiguration logic must 
do more than just accommodate isolated failures. While not 
strictly system failures, operator blunders and power transients 
may produce system states that require prompt response. 

Many factors must be considered in designing failure-tolerant 
controls, including: allowable performance degradation in the 
failed state, criticality and likelihood of the failure, urgency of re- 
sponse to failure, tradeoffs between correctness and speed of re- 
sponse, normal range of system uncertainty, disturbance envi- 
ronment, component reliability vs. redundancy, maintenance 
goals (mean-time-between failures, mean-time-to-failure, mean- 
time-to-repair, maintenance-hours/operation-hours, etc.), size 
and cost of LRUs, system architecture, limits of manual inter- 
vention, and life-cycle costs. Assessing each of these factors re- 
quires detailed knowledge of the plant and its control objectives. 


Robust Control 

Controlled system robustness is the ability to maintain satis- 
factory stability and performance in the presence of parameter 
variations, which could be due to component failures in either the 
plant or the control system. All practical controlled systems must 
possess some degree of robustness against operadonal parameter 
variations. Maintaining stability with component failures is a 
particular challenge when the plant is open-loop-unstable, as 
control-system failure may mean that the system becomes par- 
tially "open-loop.” Alternatively, a plant alteration (e.g„ the 
breaking of a stabilizing spring or the loss of an aircraft's stabiliz- 
ing surface) may force an ordinarily stable system to become un- 
stable. In either case, reconfiguration may offer the only re- 
course for stable control. It also is possible for an open-loop- 
stable plant to be destabilized by a feedback controller with failed 
control loops [6], This lack of robustness is most likely to occur 
in high-gain controllers, where open- and closed-loop dynamics 
are substantially different; robustness recovery typically requires 
lowering the control gains in systematic fashion [4,6,7], The in- 
herent stability margins of certain algebraic control laws (e.g., the 
linear-quadratic (LQ) regulator [4,8-10]) may become vanish- 
ingly small when dynamic compensation (e.g„ the estimator in a 
linear-quadratic-Gaussian (LQG) regulator) is added [11]. 
Restoring the robustness to that of the LQ regulator typically re- 
quires increasing estimator gains using the loop-transfer-recovery 
method [4,1 2], 

Subjective judgments have to be made in assessing the need 
for robustness and in establishing corresponding control system 
design criteria, as there is an inevitable tradeoff between robust- 
ness and nominal system performance [13]. The designer must 
know the normal operating ranges and distributions of parameter 
variations, as well as the specifications for system operability 
with failed components, else the final design may afford too little 
robustness for possible parameter variations or too much robust- 
ness for satisfactory nominal performance. Robustness tradi- 
tionally has been assessed deterministically [14]; it is an inherent 


part of the classical design of single-input/single-output systems, 
and there are multi-input/multi-output equivalents based on singu- 
lar-value analysis of various frequency-domain matrices [e.g., 
4,10,12,15]. The most critical difficulty in applying these tech- 
niques is relating singular-value bounds on return-difference and 
inverse-retum-difference matrices to real parameter variations in 
the controlled system. 

There is increasing interest in statistical alternatives that make 
full use of knowledge about potential system variations and that 
work directly with real parameter variations. The probability of 
instability was introduced in [16] and is further described in 
[17,18], This method determines the stochastic robustness of a 
linear, time-invariant system by the probability distributions of 
closed-loop eigenvalues, given the statistics of the variable pa- 
rameters in the controlled system's dynamic model. The proba- 
bility that any of these eigenvalues have positive real parts is the 
scalar measure of robustness, a figure of merit to be minimized 
by control system design. Extensions to the analysis of perfor- 
mance robustness and of nonlinear, time-varying systems are di- 
rect. This approach provides logical connections to reliability 
analysis of control systems, discussed below. 

It is easy to pose unreachable or irrelevant goals for control 
robustness. Problems that must be addressed in robust control 
system design include: retaining controllability and observability 
following component failure, achieving satisfactory off-design 
performance (including steady-state and tracking response as well 
as stability), minimizing compromises to on-design performance, 
and relating robustness criteria to real component failures. 


Parallel Redundancy 

In principle, tolerance to control system failures can be im- 
proved if two or more strings of sensors, actuators, and comput- 
ers, each separately capable of satisfactory control, are imple- 
mented in parallel (Fig. 3). A voting scheme is used for redun- 
dancy management, comparing control signals to detect and over- 
come failures. With two identical channels, a comparator can 
determine whether or not control signals are identical; hence, it 
can detect a failure but cannot identify which string has failed. 
Using three identical channels, the control signal with the middle 
value can be selected (or voted), assuring that a single failed 
channel never controls the plant. A 2-channel system is consid- 
ered fail-safe because the presence of a failure can be determined, 
but it is left to additional in-line (or "built-in test") logic to select 
the unfailed channel for control. The 3-channel system is fail-op- 
erational, as the task can be completed following a single failure. 
Systems with four identical control channels are called "fail- 
op/fail-op" because they can tolerate two failures and still yield 
nominal performance. In any voting system, it remains for addi- 
tional logic to declare unselected channels failed. Given the vec- 
torial nature of control, this declaration may be equivocal, as 
middle values of control-vector elements can be drawn from dif- 
ferent strings. 

Of course, the voting logic itself has some probability of fail- 
ure, and a single-point failure of a voting component could be 
catastrophic. Consequently, it may be preferable to let each 
channel remain independent through the application of control 
force, letting/orce averaging mediate failures. If control outputs 
are averaged, small variations among the parallel channels tend to 
cancel, and the net output is smooth; however, a runaway failure 
can bias the net signal away from its desired value. Voting and 
isolation of failed channels then can be carried out as an auxiliary 
process whose own failure would not disable the entire system. 
Once a failed channel has been disengaged, the total available 
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control force is reduced, changing the performance characteristics 
of the controlled system. 



logic must be sensitive to failures yet insensitive to small opera- 
tional errors, including those due to non-colocation of sensors or 
actuators. Nuisance trips (false indications of failure) must be 
minimized to assure that useful resources are kept on-line and 
missions are not aborted prematurely. Redundancy does not 
preclude identical damage to parallel systems, especially when 
they are located in close proximity. Cross-strapping implies 
complex, "intelligent" interconnections; however, if it is not im- 
plemented, a single component failure brings down an entire 
control string. Voting can be done in all operating control com- 
puters, but arbitration is required when these computers disagree. 
For the ideal parallel system, the probability P c that some compo- 
nent will fail is, 

Pc-MKXs + Xc + XaXff-io)] (15 


Figure 3. A triply redundant controlled system. 

For perfect output voting of M identical parallel channels each 
with N serial components, the failure probability Pf of the overall 

control system is, 

mt n i 

Pf=II 1 - JTe-V »[(** + + Xa)(rf-ro)] M 

j=l i=l 

= (1 - R) M (12 

Sensor, computer, and actuator failure rates 1 are X s , Xc, and X a 
(assumed to be small and uncorrelated), (tf - to) is the mission du- 
ration, and R is the single-string reliability 119,20]. If the com- 
ponents can be cross-strapped perfectly (i.e., if a failed compo- 
nent from one string can be connected to an unfailed string), the 
overall probability of failure is reduced to 


so the likelihood of component failure is increased by redun- 
dancy. It is necessary to establish rules for dispatching the con- 
trolled system: if one control string is not operational but the 
others are, should the process be initiated? For a manufacturing 
system, the answer might be "yes,” while for a transport aircraft, 
it might be "no." A non-trivial aspect of redundant control is the 
need for more electrical connectors, the components most likely 
to cause trouble! 

One insidious problem associated with parallel redundancy is 
the lack of controllability of internal state components [22], 
Consider the dual-redundant controlled system of Fig. 4, where 
the individual control outputs are averaged by M] = M 2 , and Fj 
= F 2 , Gi = G 2 , and Ni = N 2 . The dynamic equations can be 
expressed as 

*a] f Fa GaMi GaMiiPaI [0* 

xi = G 1 N 1 F 1 0 xi + Gi up (16 

*2-1 LG1N1 0 Fj J 1x2-1 LG 1 J 


m r n 

Pf=l- n 1-11(1 -e-V) 

j=si L 1=1 

- (Xs M + Ac M + Xa M )('f-'o)M (13 

Unfortunately, failures cannot be detected perfectly, and cross- 
strapping itself is subject to failure. The probability of detecting, 
isolating, and recovering from a failure — called coverage - is a 
more meaningful measure than Pf. For a 3-channel control sys- 
tem with output voting alone, the coverage C [21], or net reliabil- 
ity, is 

C = R3 + 3 R2(l - R) Prj + 3 (1 - R) 2 R Pr 2 (14 

where Prj is the probability of recovering from the first failure 
and Pfj is the probability of recovering from a second failure. 

These probabilities are not necessarily the same, as different pro- 
cesses may be used for failure detection; voting for the first fail- 
ure, in-line detection for the second. Unless the recovery prob- 
abilities are very nearly one, the maximum benefits of redun- 
dancy will not be realized. 

Problems encountered in implementing parallel redundancy 
include: selection logic, nuisance trips, generic failures, reliabil- 
ity of voting/selection units, control force contention, cross- 
strapping, increased cost and maintenance, number of operating 
channels required for dispatch, and connectors. Failure-detection 


1 In the present context, ’sensor" implies the entire suite of sensors needed 

for control, and ’computer’ and ’actuator’ are defined similarly. 


The controllability matrix C of this system is 

'0 2 GaM iG 1 2 (FaGaM 1 + G aM ]Fi)G 1 ..." 

C= Gi FjG 1 ( 2 NiG a Mi +Fi 2 )Gi 

LGi F 1 G 1 ( 2 NiG A Mi +Fi 2 )Gi 

(17 

Complete controllability requires that C be of maximal rank; 
however, that is not possible because the bottom two rows are 
repeated. In other words, the compensator state elements are not 
controllable. If the corresponding modes are stable, then small 
variations between the two controllers tend to decay; however, if 
the modes are unstable or neutrally stable (as in the case of inte- 
gral compensation), uncontrollable drift can occur, leading to di- 
vergent control outputs, nuisance trips, and possible isolation of 
otherwise operable channels. 



Figure 4. Model of a dual-redundant controller. 
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If there are sufficient cues to warn a human operator of sys- 
tem failure and plausible failure effects are slow enough to allow 
manual intervention, many of the benefits of parallel redundancy 
can be obtained by operating with a single control string, keeping 
an idle backup control string at the ready. The backup system 
can be similar or dissimilar to the primary system: however, if it 
is less capable, ability to perform the task will be degraded. 

Parallel redundancy can protect against control-system com- 
ponent failures, but it does not address failures of plant compo- 
nents. Analytical redundancy provides a capability to improve 
tolerance to failures of both types. It does this with fewer addi- 
tional components, flexible cross-strapping, and increased com- 
putation; as a consequence, there is greater reliance on the control 
computer, producing even greater need for computer reliability. 


Analytical Redundancy 

The principal functions of analytical redundancy are failure 
detection (through built-in-test alarms or off-nominal operation), 
failure identification (recognition of which components are 
failed), and control-system reconfiguration (adaptation to sensed 
or estimated failures). Detection and identification may be com- 
bined in built-in test functions. Although in-line monitors pro- 
vide direct and rapid response to specific failures, it is impossible 
to provide full coverage of all failures by specialized instrumenta- 
tion (which itself is subject to failure). A practical failure detec- 
tion, identification and reconfiguration (FDIR) solution can be 
found in the control computer's ability to compare expected re- 
sponse to actual response, inferring component failures from the 
differences and changing either the structure or the parameters of 
the control system as a consequence. 

Failure detection is exemplified by the generalized likelihood 
ratio test (Fig. 5) [23], which uses a Kalman-filter-like recursive 
equation to sense discrepancies in system response. The test 
compares the probability of the estimator's actual measurement 
residual [z - h(»)] with its expected value, detecting a jump that 
can be related to failure. It is very sensitive to off-nominal per- 
formance and is easy to implement; however, the test does not 
produce a tight indication of the failed element, and modeling er- 
rors can hamper detection [24]. 


esized but the type, magnitude, and (if taken to the extreme) even 
the time of the failure must be modeled as well. 



Figure 6. Failure identification: multiple-model h ypothesis test. 


Consider a modified form of the generic control structure: 
Au k = - ScCi^k] + b c (18 

^k+i = + ©kAuk + Kk[S s z k - h(xk,Uk) + b s ] (19 


S s and S c are scale-factor matrices on the measurements and 
control, and b s and b c are bias vectors. Within this framework, 
we can identify the elements of the control system that need to be 
modified following various failures, as in Table 1 . If the plant is 

altered, it may be necessary to change the internal model OF, 0), 
as well as the estimation and control gains (K, C), and so on for 
the remaining failure types. Precise failure identification is an 
important antecedent of control reconfiguration. Both "hard" 
(fast) and "soft" (slow) failures must be expected, and logic must 
accommodate command inputs (set-point transients), distur- 
bances, and measurement noise [27], 
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Figure 5. Failure detection: generalized likelihood ratio test. 


Failure identification may require a more specific test, such as 
multiple-model hypothesis testing (Fig. 6) [25,26]. Each failure 
hypothesis (including that of no failure) is modeled in a Kalman 
filter, and the most likely hypothesis (based on probability esti- 
mates [4]) indicates the failure state. This is a computationally 
intensive technique, as not only the failed device must be hypoth- 


Table 1 

Failure Types and Related Control-Law Parameters 

Eaks Parameter 


Plant Alteration 
Actuator Failure 
Sensor Failure 
Bias Shift 
Scale Factor Shift 
Saturation Limit Change 
Drift 

Open Circuit 

Hardover/Stuck 

Noise 


x F,e,K,C 
u, e,c 

z, h, K 
bj or be 
S s or Sc 
KorC 
b s or be 

u, ©, C, and/or z, h, K 
Open Circuit, plus fe s and/or b c 

K 


Reconfiguration attempts to retain nominal stability and per- 
formance characteristics. At a minimum, this requires that on- 
design controllability and observability (e.g., [4]) be preserved. 
There is a tradeoff between speed of reconfiguration, computer 
storage requirements, and flexibility of reaction. Controller 
structures and parameters for all conceivable failed states can bc 
generated off-line and stored for eventual use; however, this ap- 
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proach could require an enormous memory. Conversely, on-line 
design requires minimal storage and (in principle) can adjust to 
unanticipated failures, but design algorithms must be executed 
and their results accepted soon enough to provide sufficient fail- 
ure tolerance. With failed sensors, reconstruction of missing 
measurements may increase state-estimate errors; with failed ef- 
fected, the remaining actuators may have to operate with larger 
displacements and rates [28], If the plant is open-loop-unstable, 
higher control activity combined with existing control-saturation 
limits may reduce the state space within which closed-loop stabil- 
ity can be assured [29,30], 


Artificial Intelligence 

Control theory and artificial intelligence both strive to harness 
mathematics and logic for practical problem solving, but control 
theory finds its origins in dynamics and electronics, while artifi- 
cial intelligence springs from biology, psychology, and computer 
science. Failure-tolerant control systems can benefit from 
blending these perspectives. Two approaches have been fol- 
lowed in the field of artificial intelligence. Artificial neural net- 
works are motivated by input-output and learning properties of 
living neural networks, although in application the network be- 
comes an abstraction that may bear little resemblance to its bio- 
logical namesake. Expert systems mimic the intelligent functions 
of an expen or group of experts. Initially, artificial neural net- 
works appeared impractical because computers of the day were 
too slow and massive, and methods for training neural networks 
(e.g., percsptrons and adalines) were thought to be unworkable 
[31,32], In the intervening years, the expert system approach 
proved to be quite achievable; hence, it received major emphasis 
in both theoretical development and applications. New insights 
about learning and improved electronics have restored interest in 
neural networks. 


Expert systems are computer programs that use heuristic rela- 
tionships and facts as human experts do. The tasks and require- 
ments of such systems (Table 2 [33]) are important for reconfig- 
urable control systems, but there is a need to go beyond the usual 
limitations of static expert systems. Interpretation, diagnosis, 
monitoring, prediction, planning, and design must be cyclical, 
dynamic processes that can reconfigure the control system in 
"real time" (i.e., with negligible delay). 

Table 2 

Functions of an Expert System 


lask R s quirc m sms 


Interpretation 

Diagnosis 

Monitoring 

Prediction 

Planning 

Design 


Correct, consistent, complete analysis of data 
Fault finding 

Recognition of alarm conditions 
Reasoning about time, forecasting the future 
Defining actions to achieve goals 
Creating objects that satisfy requirements 


The expert system offers a useful formalism for failure-toler- 
ant control because it can consider diverse data sources and sub- 
problem abstractions. The expert system can combine qualitative 
and quantitative reasoning, heuristics and statistics [34], Failure 
indicators may be continuous variables generated by measure- 
ments or estimators, or they may be discrete variables from in- 
line monitors or discrete-event models. Indicators are the outputs 
of productions, routines 1 with unique input-output characteristics 


that produce goal conditions from initial conditions. Hence, the 
expert system can be implemented as a production system or a 
rule-based system consisting of a data base, a rule base, and a 
rule interpreter (or inference engine) [35]. A production system 
generates actions predicated on the data base, which contains 
measurements as well as stored data or operator inputs. 

A rule-based failure-tolerant control system contains FDIR 
logic in expen-system format (Fig. 7). The expert system is an 
adjunct to the nominal control structure, which remains the most 
efficient means of effecting precise control. From the control 
perspective, the expert system performs its decision-making tasks 
in a concentric outer loop-, from the expen-system perspective, 
control activity is a side effect that supports decision making. 



Figure 7. Expert-system approach to analytical redundancy. 

An expert system performs deduction using knowledge and 
beliefs expressed as parameters and rules (Fig. 8). Parameters 
have values that either are external to the expert system or are set 
by rules. An "IF-THEN" rule evaluates a premise by testing val- 
ues of ono- or more parameters related by logical "ANDs" or 
"ORs,” as appropriate, and it specifies an action that set values of 
one or more parameters. The rule base contains all the rules of 
the expert system, and the inference engine performs its function 
by searching the rule base. Given a set of premises (evidence of 
the current state), the logical outcome of these premises is found 
by a data-driven search (forward chaining) through the rules. 
Given a desired or unknown parameter value, the premises 
needed to support the fixed or free value are identified by a goal- 
directed search ( backward chaining) through the rules. Querying 
(or firing) a rule when searching in either direction may invoke 
procedures that produce parameter values as side effects. 


EXAMPLE 



Figure 8. Graphical representation of expert system knowledge. 









Both search directions are used in a rule-based control system 
[36], Backward chaining drives the entire process by demanding 
that a parameter such as CONTROL CYCLE COMPLETED have 
a value of true. The inference engine works back through the 
rules to identify other parameters that allow this and, where nec- 
essary, triggers side effects like estimation and control to set 
these parameters to the needed values. Backward chaining also is 
invoked to learn the value of ABNORMAL BEHAVIOR DE- 
TECTED, be it true or false. Conversely, forward chaining indi- 
cates what actions can be taken as a consequence of the current 
state. If SENSOR MEASUREMENTS REASONABLE is true, 
and ALARM DETECTED is false, then failure identification and 
reconfiguration side effects can be skipped on the current cycle. 

Rules and parameters can be represented as objects or frames 
that have identities and attributes. For example, a rule can be ex- 
pressed as the ordered list (NAME, STATUS, PREMISE, AC- 
TION, ACTION PARAMETERS, PREMISE PARAMETERS, 
TRANSLATION), while a parameter may take the form (NAME, 
USING RULES, UPDATING RULES, ALLOWABLE VAL- 
UES, TRANSLATION). Most of these attributes are self-ex- 
planatory. STATUS indicates the state of the rule, such as "not 
been tested," "being tested," "tested, and premise is true," 
"tested, and premise is false,” or "tested, and premise is un- 
known.” ALLOWABLE VALUES provides a mechanism for 
detecting false logic. TRANSLATION provides a natural-lan- 
guage explanation for display to the operator. Specific rules and 
parameters are represented by lists in which names and attributes 
are replaced by their values. The attribute lists contain not only 
values and logic but additional information for the inference 
engine. This information can be used to compile parameter-rule- 
association lists that speed execution [37], 

Frames provide useful parameter structures for related pro- 
ductions, such as analyzing the origin of one or more failures in a 
complex, connected system [38], The dependency graph of Fig. 
9 showing relationships between actuators and their power sup- 
plies can be represented by the random-order list (( OBJECT 
Name) ( ATTRIBUTE j Value i) ( ATTRIBUTE 2 Valued ( ... )), a 
more flexible form than the previous structure. In this applica- 
tion, the (ATTRIBUTE Value ) lists are (A-KIND-OF Device), 
(ANTERIOR <-OR> Device<s>), (POSTERIOR<-OR> De- 
vice<s>), ( CRITICALITY Number), and (UNITS Number). 
Frames possess an inheritance property; thus the object 
(( OBJECT Pivoting Actuator) (A-KIND-OF Actuator) 
(ANTERIOR Hydro-Reservoir) (POSTERIOR-OR ( Swashplate 
Pitching-Link))) lays claim to the properties of ((OBJECT Actua- 
tor) (A-KIND-OF Hydraulic Device) (UNITS (1 2))). A two- 
step process estimates the failure state. In local failure analysis, 
forward chaining assesses the impact of known malfunctioning 
units, and backward chaining finds possible causes of the 
anomalies. In global failure analysis, local failure models are 
combined, an inclusion property prunes redundant models, and a 
heuristic evaluation based on criticality, reliability, extensiveness, 
implications, level of backtracking, and severity produces a list of 
most likely failure models. 

Expen systems process lists, so it is not surprising that LISP 
(LISt Processing) is the computer language of choice for prelimi- 
nary development However, LISP is not a fast, efficient lan- 
guage and is ill-suited to real-time applications. Moreover, a 
rule-based control system uses numerical algorithms that are most 
effectively coded in languages like Pascal, C, or FORTRAN. 
Consequently, knowledge-base translation from LISP to a proce- 
dural language is a useful (if not necessary) adjunct of rule-based 
control system design. This not only speeds program execution, 
it integrates control and decision-making processes, revealing 
new possibilities for incorporating diagnostic procedures in fail- 
ure detection and identification [39]. 


Rule-based control systems must make decisions under 
uncertainty, and they can do so either by invoking certainty- 
equivalent logic, which is analogous to a well-known concept of 
stochastic optimal control, or by uncertainty management in the 
decison-making process. In the LQG regulator, uncertainties due 



Figure 9. Dependency graph of a hydraulic control system. 

to disturbances and measurement error are processed in the esti- 
mator, and the feedback control law operates on the state estimate 
as if it were the actual state [4], The optimal control gains for the 
stochastic and deterministic cases are identical. Because the rule- 
based control system described above makes its best estimates of 
the failure state in the control logic, the expert system controlling 
FDIR can treat these results deterministically, realizing little or no 
improvement from further uncertainty processing. If inner-loop 
estimation is decidedly sub-optimal, uncertainty management can 
help, using probability theory, Dempster-Shafer theory, possibil- 
ity theory, certainty factors, or the theory of endorsements [40], 
Bayesian belief networks [41], which propagate event probabili- 
ties up and down a causal tree, have particular appeal for failure- 
tolerant control and are being applied in a related program to as- 
sist aircraft crews in avoiding hazards [42]. 

Teaching the expert system the rules and parameters that gen- 
eralize the decision-making process from specific itnowledge (the 
process of induction) is another concern. Here, we have fol- 
lowed two approaches at Princeton. The first is called rule re- 
cruitment [43], and it involves the manipulation of "dormant 
rules" (or rule templates). Each template possesses a fixed 
premise-action structure and refers to parameters through 
"pointers." Rules are constructed and incorporated in the rule 
base by defining links and modifying parameter-rule-association 
lists. Learning is based on repeated simulations of the controlled 
system with alternative failure scenarios. Learned parameter val- 
ues then can be defined as "fuzzy functions" [44] contained in 
rule premises. The second approach [45] has two parts: analysis 
of variance identifies the factors that make statistically significant 
contributions to the decision metric, and the "ID3” algorithm [46] 
extracts rules from the training set by inductive inference. The 
rules take the form of decision trees that predict the performance 
of alternative strategics. 

Expert systems arc incorporated in the FDIR process to ac- 
commodate declarative functions, leaving reflexive functions to 
the estimation and control laws [43]. declarative action requires 
a deep understanding of cause and possible effect. Reflexive ac- 
tion is automatic, quickly relating stimulus to response. Both are 
needed in intelligent failure-tolerant control. 
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Artificial neural networks consist of nodes that simulate the 
neurons and weighting factors that simulate the synapses of a 
living nervous system. They are good candidates for performing 
a variety of reflexive functions in failure-tolerant control systems 
because they sire potentially very fast (in parallel hardware im- 
plementation), they are intrinsically nonlinear, they can address 
problems of high dimension, and they can learn from experience. 
From the biological analogy, the neurons are modeled as switch- 
ing functions that take just two discrete values; however, 
"switching" is softened to "saturation" in common usage, not 
only to facilitate learning of the synaptic weights but to admit the 
modeling of continuous functions. 

The neural networks receiving most current attention are 
memoryless expressions that approximate functions of the form 

y =f(x) (20 

where x and y are input and output vectors and/(*) is the (pos- 
sibly unknown) relationship between them. Neural networks can 
be considered generalized spline functions that identify efficient 
input-output mappings from observations [47,48]. Rather than 
approximating eq. 20 by a series, an N-layer neural network 
(Fig. 10) represents the function by recursive operations, 

x 00 = sOOlW^-Dx^-l)] = sOOfoOO] , k = 1 to N . (21 

where y = x(N> and x = x(°>. W( k ' 1 ) is a matrix of weighting 
factors determined by the learning process, and s^H*] is an acti- 
vation-function vector whose elements are scalar, nonlinear 

functions OjCnj) appearing at each network node: 

sOO[Ti(k)] = [oidhW) •••O n (Tb (k) )] T (22 

One of the inputs to each layer may be a unity threshold element 
that biases the activation-function output. 



Figure 10. Backpropagation Feed-Forward Neural Network. 

The sigmoid is commonly used as the artificial neuron. It is a 
saturating function defined variously as c(q) = 1/(1 + e*h) for 

output in (0,1) or o(ti) = (1 -e'^j/fl + e' 2T l) = tanh q for output 
In (-1,1). Recent results indicate that any continuous mapping 
can be approximated arbitrarily closely with sigmoidal networks 
containing a single hidden layer (N = 2) [49,50]. It appears that 
certain symmetric functions, such as the radial basis function 

(c(q) = e'h 2 ) or the derivative of the sigmoid have even better 


convergence properties. Backpropagation learning algorithms for 
the elements of W*k) typically involve a gradient search (e.g., 
[51]), although learning speed and accuracy are improved using 
the extended Kalman filter [52]. The Cerebellar Model Articula- 
tion Controller (CMAC) is an alternative neural network formula- 
tion with somewhat different properties but similar promise for 
application in control systems [53]. 

Equation 20 can represent many functions of importance in 
dynamics and control. For example, defining x as [x(r), u(t), 
w(r), p], eq. 1 takes that form; together with the implied integra- 
tion, neural networks can model plant dynamics. A discrete-time 
model of truck dynamics is demonstrated in [54], and a means of 
using neural networks in system identification is described in 
[55]. With x = [x(r), u(r), w(i), n (r), p], the measurement vec- 
tor (eq. 2) also could be represented. There is little advantage to 
expressing a linear control law like eq. 7 by a neural network; 
however, if the control gain matrix C is scheduled by operating 
point or time, that relationship could be modeled by a neural net- 
work. If a nonlinear control function such as u = u(x,x<jesired4) 
is generated by optimization, nonlinear inversion, or model 
matching, it can be represented by a neural network (e.g., 
54,56,57]. Consequently, neural networks can be incorporated 
in most of the control and FDER techniques mentioned above. 

Neural networks can be applied to failure detection and iden- 
tification by mapping data patterns (at feature vectors) associated 
with failures onto detector/identification vectors (e.g., [58,59]). 
The network is trained to detect failure with the scalar output "1" 
corresponding to all failure patterns and "0" corresponding to no 
failure. During operation, a failure is indicated when the output 
exceeds some threshold near "1." To identify specific failures, 
the output is a vector, with a training value of ”1" in the i* ele- 
ment corresponding to the i* failure mode. For M failure modes, 
either M neural networks with scalar outputs are employed or a 
single neural network with M- vector output is used; there are evi- 
dent tradeoffs related to efficiency, correlation, and so on. The 
data patterns associated with each failure may require feature ex- 
traction, pre-processing that transforms the input time series into 
a feature vector. In [59], this was done by computing 24 Fourier 
coefficients of the input signal in a moving temporal window. 
When assessing FDI logic, feature extraction must be considered 
pan of the neural-network computation. 

Of course, no* all of the suggested neural nets can leam on- 
line, as a training set must contain desired outputs as well as 
available inputs. In the cited examples, [54] and [56,58,59] use 
off-line learning , while [55] and [57] allow on-line learning. 
Reference 60 trains a neural network using an expen system that 
previously learned the desired control strategy. Once an initially 
trained system is on-line, the "off-line" training process could be 
executed in parallel with the on-line operation, allowing updates 
to be made. If the control process that generates on-line training 
data performs satisfactory control, the need for the neural net- 
work must be questioned. The goal should be to provide satis- 
factory failure tolerance with minimum hardware and software. 

Neural networks intended to detect failures would leam little 
from monitoring normally operating plants. In any case, the neu- 
ral-network learning rate is slow, probably too slow to expect 
neural networks of appreciable dimension to adapt to system fail- 
ures in real time. Hence, the immediate application of neural 
networks in failure-tolerant control systems is to approximating 
nonlinear functions used by the FDIR methods introduced earlier. 
On-line learning can fine-tune this logic over a period of time. 
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Conclusions 

Intelligent failure-tolerant control can improve the operating 
characteristics of systems. These improvements depend upon a 
good knowledge of the plant, reliable control elements, and suf- 
ficient observability and controllability following failures. Inher- 
ent robustness, the ability to accommodate failures without adap- 
tation, is a highly desirable attribute, but it may not be sufficient 
to contain all system failures. Because split-second decision and 
reconfiguration may be required, a high degree of pre-training 
should be assumed; even intelligent systems cannot learn about 
new failure modes and respond to them properly at the same time 
(except by chance). Failure-tolerant systems must be able to dis- 
tinguish between failures, disturbances, and modeling errors, re- 
sponding to each in the proper way. Probability theory provides 
an underlying theme that unifies failure- tolerant design, from the 
probability of instability of robust systems, through the probabil- 
ity of failure of redundant systems, to the probability of correct 
FDIR response in analytical redundancy. Artificial intelligence is 
a useful adjunct to parallel and analytical redundancy, as expen 
systems and artificial neural networks offer new alternatives for 
both declarative and reflexive response to system failures. 
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